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Abstract 

Causal inference in cue combination is to decide whether the cues have a 
single cause or multiple causes. Although the Bayesian causal inference model 
explains the problem of causal inference in cue combination successfully, how 
causal inference in cue combination could be implemented by neural circuits, 
is unclear. The existing method based on calculating log posterior ratio with 
variable elimination has the problem of being unrealistic and task-specihc. 
In this paper, we take advantages of the special structure of the Bayesian 
causal inference model and propose a hierarchical inference algorithm based 
on importance sampling. A simple neural circuit is designed to implement 
the proposed inference algorithm. Theoretical analyses and experimental 
results demonstrate that our algorithm converges to the accurate value as 
the sample size goes to infinite. Moreover, the neural circuit we design can 
be easily generalized to implement inference for other problems, such as the 
multi-stimuli cause inference and the same-different judgment. 

Keywords: Causal inference, importance sampling, cue combination, 
neural circuit 


1. Introduction 

Human brain receives cues from multiple sensory modalities and inte¬ 
grates them in an optimal way [1] . The cues from the outside world are noisy 
observations of stimuli reflecting uncertainty. It has been demonstrated that, 
if all cues have the same cause, the optimal process of cue combination is a 
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process of Bayesian inference EiEiiaE]. However, the trnth is that, we re¬ 
ceive information from varions sonrces simnltaneonsly in onr daily life, which 
means the cnes may come from different canses. How to decide whether a 
single canse or mnltiple canses is responsible for the cnes, known as cansal 
inference in cne combination, is an important problem. This problem is the 
precondition of cne combination and is qnite common in onr daily life EC]. 
For example, at a cocktail party, we need to decide whether the face and 
voice belong to the person who calls onr name |8]. Recently, the problem of 
cansal inference in cne combination is partially answered by Hording et ah |9] 
and Sato et ah ra. who propose the Bayesian cansal inference model. Their 
cansal inference model snccessfnlly explains the problem of cansal inference 
in cne combination. Yet, how cansal inference in cne combination conld be 
implemented by nenral circnit, is nnclear. Solving this problem benehts not 
only theoretical researches bnt also practical applications. On the one hand, 
cansal inference is the basis for cne combination. On the other hand, if the 
cansal inference conld be implemented by nenral circnits, the nenral circnits 
conld be nsed to perform cansal inference in cne combination for robots. 

Over the past decade, several methods with different probability codes 
have been proposed to perform probability inference with nenral circnits. 
Rao muni [13] establishes the relationship between the dynamic eqnation 
of nenral circnits and the inference of probabilistic graphical models. He 
proves that the process that the bring rate of nenrons in the recnrrent nenral 
circnit varies with respect to time is a process of posterior probabilities in¬ 
ference in a hidden Markov model, nnder the condition that the bring rate is 
proportional to the log of posterior probabilities. Ott and Stoop |ll| bnild the 
relationship between the dynamical eqnation of continnons Hopbeld network 
and belief propagation on a binary Markov random held. Sampling is another 
commonly accepted way to perform inference by nenral circnits. Based on 
Monte Carlo sampling, Hnang and Rao na bnild a spiking network model 
to perform approximate inference for any hidden Markov model. Maass et 
al. [ini dZl dH] propose that stochastic networks of spiking nenrons conld im¬ 
plement inference for graphical models by Markov chain Monte Carlo. Shi 
and Griffiths [T9| apply importance sampling to perform inference of chain 
Bayesian model and design nenral circnits to implement it. Another impor¬ 
tant framework is Probabilistic popnlation coding (PPC), the core idea of 
which is that the nenrons are encoders of distribntions, instead of the valnes 
of variables [2D1EI1E2I. Ma et al. [20] present that the inference of cne inte¬ 
gration can be condncted simply by linear combinations of each population 


2 


activity with PPG. The method is exploited thereafter by Beck et ah [23] to 
realize the Bayesian decision making and the inference of marginalization 

m- 

To the best of our knowledge, the only work implementing causal inference 
in cue combination with neural circuits is proposed by Ma et ah [2S| in 2013. 
They calculate the ratio of the posterior probabilities of both situations (a 
single cause or multiple causes) with variable elimination and then design a 
neural circuit to implement it. This method suffers from three shortcomings. 
Firstly, the circuits they design are task-specific and only work on two stimuli. 
If we want to implement multi-stimuli causal inference [23] with the same 
method, the circuit will be completely different. What’s more, the required 
number of operations increases faster than linear with respect to the number 
of stimuli, which makes the neural circuit unrealistic 125 ]. Secondly, it is hard 
to generalize the circuit to implement a similar task called same-different 
judgment [2^. Thirdly, since how to implement logarithmic operations with 
neurons remains unknown, approximations are taken in their neural circuit 
so that they could only get near-optimal results. 

In this paper, different from calculating the posterior ratio with variable 
elimination in ra, we propose a hierarchical inference algorithm based on 
importance sampling, which takes advantages of the special structure of the 
causal inference model. A neural circuit with hierarchical structure is then 
designed corresponding to the bottom-up inference process. The proposed 
method has three advantages. Firstly, the neural circuit is simple and it is 
easy to be realized by PPG and some simple plausible neural operations. 
Secondly, it is easy to generalize this neural circuit to implement inference 
for other problems, such as the multi-stimuli cause inference and the same- 
different judgment. Thirdly, a theoretical proof is given that the sampling- 
based method converges to the accurate value with probability one as sample 
size tends to inhnity. 

The rest of this paper is organized as follows. Section 2 briefly reviews 
the causal inference in cue combination. In section 3 we present a sampling- 
based inference algorithm and design the corresponding neural circuit. The 
experimental results are shown in section 4. We generalize our method to 
solve other two problems in section 5 and make a conclusion in section 6. 
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Fig. 1. The causal inference model in cue combination. 

2. The Causal Inference Model In Cue Combination 

The problem of causal inference in cue combination is to infer whether 
cues come from a single or multiple causes. Kording et ah |9] and Sato et ah 
ra propose a causal inference model of cue combination respectively, which 
could explain physiological and psychological experiments successfully. Here, 
we briefly review this model and the stimuli considered here only include 
visual and auditory ones. The multi-stimuli problem will be explained in 
section 5. In Fig. [1} node C represents the common-cause variable. S', Si, and 
S '2 express the stimuli. Xi and X 2 are cues received by the sensory system. 
The state of cause C is 1 or 2, where (7=1 means the cues have the same 
cause and (7 = 2 means the cues have two different causes. For simplicity, we 
assume that P ((7 = 1) is equal to P ((7 = 2), both of which have a probability 
0.5. When (7 = 1, there is a stimulus S with distribution P{S) corresponding 
to the common cause, where P{S) is a Gaussian distribution with mean 0 
and variance (j|. Two measurements Xi and X 2 are generated from two 
Gaussian distributions with different variances af and cr|, but with the same 
mean S. When (7 = 2, there are two different stimuli Si and S' 2 , which are 
drawn from the same Gaussian distribution with mean 0 and variance cr|. 
Then two measurements Xi and X 2 are drawn from two different Gaussian 
distributions with their means being Si and S' 2 , and their variances being 
af and respectively. Based on the dehnitions above, the causal inference 
problem is to decide whether (7 = 1 or (7 = 2 according to the measurements 
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Fig. 2. The three-layer Bayesian network eqnivalent to the cansal inference 
model in Fig. 1. 


Xi and X 2 . 


3. Sampling-Based Causal Inference 

In this section, we first convert the causal inference model to a three-layer 
Bayesian network. Then we propose a sampling-based hierarchical inference 
method and design the corresponding neural circuit. We demonstrate that 
this circuit can be realized by PPG and simple plausible neural operations. 

3.1. The three-layer Bayesian network model 

In this paper, the problem is to infer the state of node C. In order to 
simplify inference, we convert the causal inference model above to a three- 
layer Bayesian model (Fig. with some appropriate prior probabilities 
and conditional probabilities. In the new model, node C is the common- 
cause variable, which is similar to that in the causal inference model. Si 
and S 2 refer to two different stimuli, such as visual and auditory stim¬ 
uli. The conditional probability of Si and S 2 under C is expressed as 

P(^i, 52|C'). We define P (5i, PalC" = 1) = ^ (^i - P 2 ) ^ exp 

and P {Si, S 2 \C = 2) = exp where 6 {Si — S 2 ) is the Dirac 

Delta distribution. Xi and X 2 are measurements from Pi and S 2 respec¬ 
tively. The conditional probability of Xi under Pi is defined by P (Xi|Pi) = 

yi— exp j and the conditional probability of X 2 under S 2 is de- 
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fined by P {X 2 \S 2 ) = ^ 75 ^ exp j • It is easy to verify that this 

Bayesian network is equivalent to the causal inference model in Fig. [Tj 


3.2. Sampling-based inference algorithm 

Several methods have been developed to perform inference for the Bayesian 
model shown in Fig. such as belief propagation (BP) [28] and Markov chain 
Monte Carlo (MCMC) [22], all of which are able to be implemented by neu¬ 
ral circuits I3D1 ED HSl E|. However, all the circuits have the shortcoming 
of being task-specihc. Specihcally, the neural circuit for belief propagation 
[301 ED requires pools of spiking neurons to represent function nodes of the 
factor graph. It is hard to generalize the circuit of the Bayesian model in Fig. 

to implement multi-stimuli causal inference. Similarly, the neural circuit 
OEl EZI based on MCMC should meet the neural computability condition 
(NCC) and the circuit will be completely different for multi-stimuli causal 
inference. In this paper, we aim to build a general-purpose neural circuit for 
causal inference in cue combination. Here we utilize importance sampling to 
perform inference. Importance sampling is a kind of Monte Carlo methods 
in statistics, which is used to estimate the intractable integrals by random 
sampling. Different from other Monte Carlo methods, importance sampling 
generates samples from a simple distribution rather than the original distri¬ 
bution [32l [33|- Here we give a simple example. 


B(/(V))pto 

= ff{X)P(X)dX 


= I^9(X)dX 

X 

= Bl 


f(X)P(X) \ 
ff(X) 
m 

- E 

i=l: 

X,^g{X) 


h{X) 

f(Xi)P{Xi) 

9{Xi) 


( 1 ) 


In equation ([^, the goal is to calculate the mathematical expectation 
of f {X), where X follows the distribution P{X). There are cases where 
we can’t sample from the original distribution P (X) of variable X directly. 
Instead, we can calculate the expectation of / (X) P (X) /g (X) with X fol¬ 
lowing the simple distribution g{X). Note that for the region with larger 
value oi g {X), the sampling points should be denser, which means the sam¬ 
ples are more important. 
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Using importance sampling to perform inference has its nenral basis. The 
responses of nenrons have been interpreted as Monte Carlo samples by Hoyer 
and Hyvarinen which means that the state of each neuron is drawn ran¬ 
domly from a special distribution. Shi and Griffiths [1^ have used impor¬ 
tance sampling to perform inference of chain Bayesian network. By taking in 
the idea of hierarchical inference, we generalize importance sampling to our 
model. What’s more, we will prove the convergence of the sampling-based 
inference method. We hrst consider a Bayesian network with only two nodes 
A and B, where A is the parent node of B. It is easy to obtain the conditional 
expectation of A given B with importance sampling: 


j:f[A)PiA\B)P(B) 

E if (A) \B) = ^f (A) P {A\B) = ^ - 

^ ^f{A)P{B\A)PiA) ^ E{f{A)P{B\A))p(^^^ 
^P(B\A)P{A) E(P(B\A))p^^^ 

E f{A')p{B\A^) 

~ E P{B\A^) 


A^:A^^P(A) 

= E f(A) 

A^-.A'^P{A) 


p(n|A*) 

E P{BW) 


A®:A»~P{A) 


( 2 ) 


In equation ([^, the approximation holds as we use importance sampling 
to estimate the expectation. ^4* ~ P (A) means that the sample A^ is drawn 
from the distribution P (A). Here A is discrete and the sums are replaced by 
integrals when A is continuous. It should be noted that the same samples of 
are used in both sum. When we calculate the last term in ([^, we hrst 
calculate the sum of in the denominator and then calculate the sum of 
in the numerator. Equation (|^ could be generalized to solve the inference 
problem in Fig. 
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P{C=l\X, = x,,X, = X2) 

= P {C = l|a:i,a;2) 

= / P{C =l,Si,S2\Xi,X2)dSi,S2 

= Y P{C = l\S^,S2)P (^ 1 , S2\X^,X2) dS,, S2 


51.52 

f P(C=l\Si,S2)P{xi,X2\Si,S2)P{Si,S2}dSi,S2 

51.52 


f P{xuX2\Sl,S2)P{Sl,S2)dSl,S2 

^ E(P(C=i\k,S2)Pixi,X2\SuS2))p^S^^S2) 

E(P{xi,X2\Si,S2))pi^S^^S2) 

N 

M E P(c = i|s;,s-)- 

1 = 1 

S*,S*~P(Si,S 2 ) 


P(xj,x2\Sl,Si) 


S'J.S; 


N 

E 

2=1 

~P(5i,S2) 


P{xi,X2\Si,Sl) 


N 


E 

2 = 1 

Si,5'~P(5i,52) 


P( 5 ;, 5 ^|C’=l) 

P(sj |C=l)+P( 5 j ,S' |C= 2 ) 


P{xuX2\Si,Si) 

53 P(3;i,a;2|5*,5*) 

5j,5Ep(Si,S2) 


N 

E -f(E = E) 

2 = 1 

Si,5'~P(5i,S2) 


P(xi,x2|5i,5^) 

E P{xi,X2\Si,S^^) 
Sj,5Ep(Si,S2) 


(3) 


In equation the sample S '*’*^2 drawn from P{Si,S 2 ). We ab¬ 
breviate Xi = xi,X 2 = X 2 to xi,X 2 and this will hold in the rest of the 
paper. / (S'* = S'*) is an indicator function, it equals to 1 only when S"^ = 

S'g. The last equality holds due to the dehnitions of P(Si,S 2 |C = 1) and 
P {Si, S 2 IC = 2). Note that equation (|^ also holds for P {C = 1) ^ 0.5. It 
is easy to hnd that equation ([^ remains the hierarchical structure of Bayesian 
model in Fig. Based on this, a neural circuit with a hierarchical structure 
could be designed corresponding to the bottom-up process of inference. We 
will discuss this part in detail in the following subsection. 

An important index for sampling-based algorithm is its accuracy. The 
following theorem elucidates that our algorithm converges to the accurate 
value with probability one as the sample size goes to inhnity. The proof of 
theorem 1 is provided in Appendix A. 

Theorem 1. The distributions P (C), P {Si, S 2 \C), P (Xi|S'i) and P (X 2 |S'i) 
are dehned on the Bayesian network in Figj^ SI, SI P {Si, S' 2 ), then for 
arbitrary small number e 
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Fig. 3. The tuning curve of a neuron in the primary visual cortex (VI) and 
its neural variability. 


lim P 

N^oo 


EP(C = 1|S:.S^ 


2 = 1 


P{xuX2\Si,S^^) 

P{xx,X2\S'^,Sl^) 

i=l 


P{C 


1 ^ 1 ,3:2) 


< £ 


1 


(4) 


3.3. Implementation with Neural Circuits 

In this section, we design a neural circuit to implement sampling-based 
causal inference in cue combination. According to recent studies, one of the 
most accepted neural circuits to implement probability inference is based on 
PPG and some plausible neural operations. Related researches include that 
of Ma et al.|20], presenting that the inference of cue integration can be con¬ 
ducted by PPG and a plausible neural operation-linear combinations. Beck et 
al.pi] implement the inference of marginalization with PPG, quadratic non¬ 
linearity and divisive normalization. Our method adopts the same structure 
as that of Ma and Beck, and the neural circuit is designed based on PPG and 
three types of plausible neural operations, including multiplication, normal¬ 
ization and linear combinations. Here we hrst give a simple explanation of 
PPG. PPG takes advantages of the variability in neuronal responses and con¬ 
siders a population of neurons as the encoders of probability distributions. 
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w,=/(5;=5;) wj=/(s;^s;) 


Coding of the result 



Fig. 4. The neural circuit of sampling-based method for causal inference in 
cue combination. 
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rather than the values of variables. Specihcally, for independent Poisson 
spiking neurons, the distribution of the responses r = {ri,r 2 , ••■,'rAr} to the 
input stimulus S' is P (rlS*) = where /j (s) is the tuning curve 

i 

of the neuron i. The tuning curve is a function of S, which represents the 
average bring rate to stimulus S over trials (In theory, an inbnite number 
of trials). Fig. shows an example, the blue curve is the Gaussian-like 
tuning curve of a neuron in the primary visual cortex (VI). This neuron 
is sensitive to the moving direction of the stimulus. The red circles are the 
bring rates with respect to diberent moving direction of the stimulus in a 
trial. The red circles are not always on the blue curve because the tuning 
curve is the average bring rate and neural response has variability. The neu¬ 
ral circuits for equation (|^ are shown in Fig. 0 We suppose that there 
are Poisson spiking neurons S'JS' 2 , S'^S'I,..., S'f^S'^'^ith their states sampling 
from P(S'i,S' 2 ). The tuning curve of the neuron S'^S'g is proportional to 
P (Xi, X 2 |S'J, S*^). These assumptions are reasonable as physiologically stud¬ 
ies [35l [36l [37] have demonstrated that the quantity of neurons in human 
brain follows some prior distributions. The Poisson spiking neurons 5 'J^*S' 2 , 
S'fS'l,..., Si S 2 are used to code the input stimuli Xi, X 2 and the output br¬ 
ing rates are r{, rf,..., respectively. The bring rates are then normalized. 
Note that the normalization operation can be realized by inhibitory neurons 
[38] . If we use R to express the total bring rate, where P = then we 

i 

can get E (r,/P|P = n) = P (W, X 2 |Fi, S^) / P ^ 21 ^ 1 , , which 

is proved in [12]. The equation above means the expectation of the nor¬ 
malized bring rate for Poisson neurons equals to normalized probability 

P (Xi,X 2 |S'(, S'*) / P ( 7 ^ 157 ^ 2 1S(, S*)^ . These neural activities are then 

fed into the third layer with synaptic weights wi and W 2 , where wi = 
I (SJ = S 2 ) and W 2 = I (S| 7 ^ S^)- In the fourth layer, a max operation 
is taken to decide whether the cause is 1 or 2. Note that, the precondition 
of the inference is that we have known the prior probability and conditional 
probability. We suppose that the prior probability is presented by the dis¬ 
tribution of Poisson spiking neurons, which means the states of the Poisson 
spiking neurons follow the prior distribution. We also assume that the tun¬ 
ing curves are proportional to conditional probability. With the benebt of 
sampling-based inference, massive number of neurons could sample in paral- 
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Fig. 5. The firing rate of the Poisson spiking neurons in a trial. 


lei and calculate without iteration. This means that the neural circuit could 
trade space for time thus the inference would be quite rapid. 

4. Experiments 

In this section, we demonstrate the merits of the proposed method with 
experiments. Samples of Xi and X 2 are generated according to the prior 
probabilities P {C) and conditional probabilities P (Fi, S' 2 |C'), P (XilFi), P (X 2 |S' 2 ) 
in three steps. First, we generate samples of variables C with equality prob¬ 
abilities 0.5 for each state. Then for each sample C*, if C* = 1, will be 
generated from a Gaussian distribution with mean 0 and variances (t|, and 
S'g = If G* = 2, S{ and S 2 will be drawn from the same Gaussian distri¬ 
bution whose mean is 0 and variances is cr|. At last XI and X 2 are generated 
from two different Gaussian distributions, whose means are SI and S 2 and 
variances are af and (t| respectively. 

4-1- Experiment 1: Simulating the Poisson spiking neurons and their firing 
rates 

Here we simulate the behaviors of the Poisson spiking neurons in Fig. 

We hrst generate the input XJ and X 2 randomly with the method 
proposed above. The parameters are specific to as = 4, ai = ct 2 = 6. 

Then we generate 1000 Poisson spiking neurons S'^S'I,..., 5 'iooo^iooo 

by sampling from P {Si, S' 2 ). The tuning curve of the neuron S'^S '2 is set to 
10000 X P {Xi, X 2 \Sl, Sf). Fig. [^represents the firing rate r{, r^,..., 
of the 1000 Poisson spiking neurons in a trial (the firing rate could vary 
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Fig. 6. The normalized firing rate compared with the normalized distribution 
in a trial. 

in different trials due to neural variability). Note that here we only show 

the results of neurons whose indexes range from 1 to 1000 with a uniform 

spacing 30. In Fig. the circles on the blue curve represents the nor- 

1000 

malized bring rate for neuron SIS 2 , which is r\/ ^ r\. Similarly, we only 

i=l 

show the results of neurons whose indexes range from 1 to 1000 with a uni¬ 
form spacing 30. The plus on the red curve is the normalized probability 
1000 

P (Xi,X 2 |S'|, ^ P (2^1,2^215'^, SI 2 ). We can see that the normalized hr- 

i=l 

ing rate is close to the normalized probability. 

J 1 .. 2 . Experiment 2: Testing on the convergence and accuracy of our method 

Here we simulate and present the behaviors of the neurons in the last two 
layers and show the convergence and accuracy of our method. We brst gen¬ 
erate 1000 inputs of X\ and X 2 randomly with the method proposed above. 
For each inputs X\ and X\, as, (Ti and (J 2 are drawn randomly from a uniform 
distribution on [3 7]. Then we calculate P {C = l\X\,Xlf) with the sampling- 
based method and express the result as psampie _ \\^x\,X\). Meanwhile, 
the truth of P (C = 1|X*, X*) is expressed as {C = l|Xi, X*), which is 

calculated with the elimination method in [16]. The error of samples X| and 
X* is debned by h = (C = l|Xi,X*) - (C = 1|X{,X*)|. This 

index expresses the gap between the sampling value and optimal value of pos¬ 
terior probability. The mean error is calculated from 1000 diberent inputs. 
The error rate represents the proportion of false results in the 1000 diberent 
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Fig. 7. Mean error of the posterior probability varies with sample size. 



Fig. 8. Error rate of causal inference. 
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inputs when we infer the cause. We calculate the mean error and the error 
rate with different sample sizes and repeat the experiments 10 times. The 
mean error of the posterior probability varying with sample size is shown in 
Fig. B We hnd that the mean error decreases as the sample size increases 
and converges to zero when the sample size tends to inhnity. This result 
demonstrates that if we have enough samples, or we have enough neurons, 
our algorithm will get the optimal value. Fig. [^plots the error rate obtained 
from our method and that from the method of Ma et al [16]. The method of 
Ma et al. is not related to sample size while ours could get stretchable results 
(different accuracies) with different sample sizes. Obviously, our method is 
superior to Ma’s when the sample size is larger than 500 (500 neurons for 
each variable). We can see that in order to keep the error rate under 0.05 
for two-stimuli causal inference, we need at least 1000 neurons to represent 
each variable. This means N = 1000 in equation (3). 

4-3. Experiment 3: Testing on the applicability of the method with different 
parameters. 

Experiment 1 and 2 indicate that our inference method could get the 
optimal solution given enough samples. However, the parameters cxi, a 2 
and (Js are drawn randomly for each trial. In experiment 3, we will make a 
concrete analysis of the applicability of our method with different parameters. 
In this experiment, Ui, ct 2 and as can vary from 1 to 8 . We test the error 
rate for different ui, <72 and as with the sample size being 1000 and show 
the result in Fig. In each sub-hgure, as is set to a hxed value while both 
(72 and as vary from 1 to 8 . We can see that the error rate is less than 0.1 
for most of the parameters. However, when <75 is in close proximity to zero, 
the error rate turns out to be very high. This could be explained as follows. 
Since as is close to zero, the difference between Si and S 2 remains very small 
no matter C* = 1 or 2. Then the posterior probability of a common cause and 
two different causes both will be quite near to 0.5. Due to this, a very small 
error could lead to incorrect inference results, making the error rate very 
high. Nevertheless, if there are adequate samples, the error rate could be 
arbitrarily small, which means our method is robust to different parameters. 

Experiment Testing on the probability of reporting a common cause 
with respect to stimulus disparity. 

Stimulus disparity refers to the space difference between different stimuli, 
which is dehned by S 2 — Si, where Si and S 2 are two different stimuli. In- 
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Fig. 9. Error rate for different parameters with sample size equals 1000. 

















































































































































































Fig. 10. Comparison of proportion reports C = 1 with respect to stimulus 
disparity between accurate equation and our sampling-based method. 

tuitively, it is more likely that there is a common cause if stimulus disparity 
is small, while there are two different causes if stimulus disparity is large. 
In this experiment, 200000 samples are generated with parameters ui = 3, 
o's' = cr 2 = 10. For each SI and S^, stimulus disparity is defined by — S*!. 
The state of variable C* is inferred by optimal equation [25] and our method 
respectively. Then for the samples with the same stimulus disparity, we cal¬ 
culate the proportion of reporting a common cause. Fig. [TO] shows the result, 
the red curve is obtained by optimal equation. The black, green and blue 
curves are calculated by our method with sample size being 100, 300 and 
1000 respectively. The result shows that as the sample size becomes larger, 
the sampling-based curve tends to be closer to the accuracy curve. When 
sample size equals 1000, the sampling-based curve is almost the same as the 
accuracy curve. This result indicates the accuracy of our method. 

5. Generalization 

In this section, we generalize our sampling-based method to implement 
inference for other two important problems: the multi-stimuli causal infer¬ 
ence in cue combination and the same-different judgment. 
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Fig. 11. The Bayesian model for the multi-stimuli causal inference. 


5.1. Multi-stimuli causal inference 

In the above experiments, situations where there are only two stimuli 
were taken into consideration while in our daily life cues may come from 
multiple sensory modalities, such as visual, auditory, and tactile. Despite 
the fact that Bayesian model could also explain the causal inference problem 
with multi-stimuli [26], how to implement inference with neural circuits is 
unclear. With our sampling-based method, it is easy to generalize the neural 
circuits to implement inference for multi-stimuli. 

The steps are similar to that in section 3. First we convert the problem 
to a Bayesian network (shown in Fig. 11). The prior probabilities and 


conditional probabilities are dehned according to the causal inference model, 
where different states of C reflect different situations of the cause. The causal 
inference problem is then converted to the inference of posterior probability, 
which could be calculated by importance sampling: 





P{C = l|Xi = Xi,X2 = X2, ...,Xn = Xn) 

= P(C = l\Xi,X2, ...,Xn) 

= f P(C = l,Si,S 2 ,...,SniXi,X 2 ,...,Xn)dSi,S 2 ,...,Sn 

SuS2,...,Sn 

= f P(C = IjSi, S2, S„) P (Si, S2, Snjxi, X2, Xn) dSi, S2, S„ 

Sl,S2,...,Sn 

f P(C=llSl,S2,...,S„)P(xi,X2,...,XnlSi,S2,...,S„)P(Sl,S2,...,S„)dSl,S2,...A 
_ Si.S2,...,S„ 

“ / P(xi,X2,...,X„ISi,S2,...,Sn)P(Si,S2,...,S„)dSl,S2,...,S„ 

Si ,82 1 • • • 


E{P{xi,X2,...,X„\Sl,S2,---,Sn))p(^S-^,S2,-.-,Sn) 
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(5) 

Equation ([^ is similar to (|^ except that the stimuli here are Si, S 2 , ■■■, Sn 
rather than Si, S 2 - According to (|^, the neural circuit of multi-stimuli causal 
inference is similar to that of two-stimuli causal inference except three dif¬ 
ferences. Firstly, the states of the Poission spiking neurons sample from 
P (Si, S 2 , Sn), rather than P(Si,S 2 ). Secondly, the tuning curve of the 
neuron marked as i is proportional to P (xi, X 2 , ■■■, Xn\Si,Si,...,Si). Thirdly, 
the synaptic weights are I (S'* = S'* = ... = S'*) instead of I (S'* = S'*). 

We also test on the convergence and accuracy of our method for multi- 
stimuli causal inference to show its feasibility. Three-stimuli casual infer¬ 
ence is tested because the situation is quite common in our daily life, such 
as the integration of visual, auditory, and tactile input. We also test ten- 
stimuli to show that our method applies to higher dimensions. We assume 
that P (C = 1) is equal to P (C = 2), both of which have a probability 0.5. 
Note that C = 1 means the three cues have the same cause and C = 2 
means the cues have diverse causes. We define P (Si, S 2 , ■■■, St\C = 1) = 
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Fig. 12. Error rate of the multi-stimuli causal inference. 
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^^p(-S^)UHSi-Sj) emd P{Si,S2,...,St\C = 2) = 
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Sf+g|+...+5| 
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27Ta% 
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Besides, the conditional probability of Xi under Si 


a = 1,2, 


..,f) is dehned by P(Xj|5j) = exp 

^TTO"2 


The experi¬ 
mental procedure is similar to that of Experiment 2 in section 3 and the result 


{Xi-SiY 

2a? 


is shown in Fig. We can End that the error rate decreases as the sample 
size increases and convergences to zero when sample size tends to inhnity. 
We also hnd that we don’t need to scale up the samples when dimensions 
become higher. 1000 samples (neurons) are required for each variable to keep 
the error rate under 0.05. These results are in good agreement with the fact 
that importance sampling does not scale up with higher dimensions. 

5.2. Same-different judgment 

When faced with multiple objects, probably the hrst thing our brain needs 
to do is to decide whether they are the same or not. Thus the same-different 
judgment could be critical in perception and cognition. A straightforward 
example is object classihcation. Human brains are able to recognize the same 
object and assign them to the semantic classes. Berg et ah m propose 
the optimal-observer model and prove that the same-different judgment is 
a process of probability inference. As illustrated in Fig. variable C 


represents the judgment, C = 1 means the objects are the same while C = 2 
means they are different, /i is a single value parameter variable generated 
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Fig. 13. Optimal-observer model of the same-different judgment problem. 

from a uniform distribution ranging from —L to L. When the objects are 
the same, Si equals to /x. When they are different, Si is drawn from a 
Gaussian distribution with mean Hi and variance (j|. The distribution of 
Xi is a Gaussian distribution with its mean being Si and its variance being 
af. Based on these dehnitions, the same-different judgment problem can be 
converted to the posterior probability inference problem of variable G, which 
could be calculated by importance sampling: 
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In equation ([^, / (—L < S'* = S'* = ... = S'* < L) is a indicative function. 
It equals to 1 only when all the stimuli are the same and between —L and 
L. Equation ([^ differs from equation ([^ in the indicative function. Due 


P(xi,..,a;„|Sj,...,S;) 

( 6 ) 
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Error rate of same-different judgment with three objects 



Fig. 14. Error rate of the same-different judgment 


to this, the neural circuit for the same-different judgment are similar to 
that for multi-stimuli causal inference except one difference. That is, the 
synaptic weights of the neural circuit for the same-different judgment are 
I {-L <S[=Si = ... = Sl<L), instead of I (F* = ^* = ... = Si). 

We test the accuracy and convergence of our method for the same-different 
judgment problem. In this experiment, we compute the error rate of same- 
different judgment with three objects and ten objects respectively. Samples 
of Xi, X 2 ...., and Xt are generated according to the similar method in the 
section of multi-stimuli causal inference. Note that /i* is generated from the 
uniform distribution ranging from —10 to 10. as and Uj are drawn randomly 
from a uniform distribution on [1 3]. The inference result with our method 
is present in Fig. 14, which is similar to that of Fig. 13 We can End that 


the error rate of same-different judgment with three objects and ten objects 
both decrease as the sample size increases and convergence to zero when 
sample size tends to infinity. Besides, the samples needed don’t scale up 
when dimensions become higher. 5000 samples (neurons) are required for 
each variable to keep the error rate under 0.05. 


6. Conclusion 

In this paper, we propose an inference algorithm for causal inference in 
cue combination based on importance sampling and design a corresponding 
neural circuit to implement this inference algorithm. The neural circuit is 
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plausible as it is based on PPG and three types of plausible neural operations. 
Theoretical analysis and experimental results show that our algorithm can 
converge to the accurate value as sample size goes to inhnite. It is worth not¬ 
ing that our method provides a general solution to the other two important 
problems, namely the multi-stimuli causal inference and the same-different 
judgment. 

Different from Markov chain Monte Carlo [El HU, which represents the 
distribution with variability of a neuron over time, our method utilizes the 
variability over neurons to represent the distribution. This means that mas¬ 
sive number of neurons could sample in parallel and calculate without itera¬ 
tion, thus the inference would be quite rapid. 

Despite the plausible neural implementation of inference, the question 
of how to learn the prior probabilities and conditional probabilities with 
learning rules found in biological studies requires considerable future work. 
Besides, learning and inference should be implemented by the same neural cir¬ 
cuits. Some recent works have provided reference experiences for implement¬ 
ing learning. For example, Maass et ah prove that Spike-Timing-Dependent 
Plasticity (STDP) is able to approximate a parameter estimation algorithm- 
expectation maximization (EM) algorithm HOI ITT] . This principle may 
be used to solve the learning problem in our paper. 


Proof of Theorem 1 


Lemma 1. Supposing that random variables X^, are pairwise 

independent and X* ~ P(X). Similarly, are pairwise inde¬ 

pendent and Y^ ~ PO^)- Besides, E (X) = pi, E iY) = /i 2 , Pi ,/^2 7 ^ 0, 
Var (X) = al and Var (X) = (t|. Then for arbitrary small number e, we can 


conclude that P 
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E 

Tv 

E 



> 1 - 


16 af 

ivFfP 


IGfl^CTy 


Proof: As random variables X^, X^,..., X” are pairwise independent and 
X* ~ P {X), X^,X^,...,X” are pairwise independent and Y^ ~ We 

can get 


E 

E 



N \ 
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= ^ E B (V) = 
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N \ N 2 
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J=1 


i=i 


For arbitrary small number ei and £2 = application of the Chebyshev’s 
Inequality yields the inequality: 
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An application of Taylor’s formula yields: 
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The equations above indicate that 
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Next, we rewrite the equation above as 
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Then for arbitrary small number e, we have 
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We conclude that for arbitrary small number s, 
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Lemma 2. Supposing that random variables are inde¬ 

pendent pairwise and X® ~ P (X).Besides, we also know that P (X) = /xi, 
/ii 7 ^ 0, l^ar (X) = a^. Then for arbitrary small number e, 
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N 

N 2^ ^ 



> 1 


N 




Proof: The proof is similar to that of Lemma 1. 

Theorem 1. The distributions P (C), P {Si, S 2 \C), P (Xi|S'i) and P (X 2 |S'i) 
are defined on the Bayesian network in Figj^ SI, SI rsj P {Si, S 2 ), then for 
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arbitrary small number e, 
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Proof: Supposing that 
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Note that the variance is denoted as af. It is easy to use Lemma 2 to show 
that for arbitrary small number e, 
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Note that the variance is denoted as here. Since ^\^u^ 2 )Pixi,x 2 ) _ 
P [C = l|a;i,a; 2 ), it is easy to use Lemma 1 to show that for arbitrary small 
number e, 

p (I/2 ixi,X 2 ) -PiC = llxi, X 2 )\<e)>l- ^ 

We also know that 

P{\fi{xuX2)-f2{xuX2)\<2e)> (1 - ’ 
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This term holds as P {A D B) > P (A) + P (B) — 1. Then for arbitrary small 
number e, we have 
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When N goes to inhnite, we calculate the limits for both sides and conclude 
that, for arbitrary small number e, 
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